A method, an apparatus and a computer program product for volumetric video encoding and video decoding

ABSTRACT

The embodiments relate to a method comprising receiving as an input a volumetric video frame comprising volumetric video data (910); decomposing the volumetric video frame into one or more patches, wherein a patch comprises a volumetric video data component (920); packing several patches, where at least two patches of the several patches comprise a different volumetric video data component with respect to each other, into one video frame (930); generating a bitstream comprising an encoded video frame (940); signaling, in or along the bitstream, existence of encoded video frame containing patches of more than one different volumetric video data component (950); and transmitting the encoded bitstream to a storage for rendering (960). The embodiments also relate to a technical equipment for implementing the method.

TECHNICAL FIELD

The present solution generally relates to volumetric video.

BACKGROUND

Since the beginning of photography and cinematography, the most commontype of image and video content has been captured by cameras withrelatively narrow field of view and displayed as a rectangular scene onflat displays. The cameras are mainly directional, whereby they captureonly a limited angular field of view (the field of view towards whichthey are directed).

More recently, new image and video capture devices are available. Thesedevices are able to capture visual and audio content all around them,i.e. they can capture the whole angular field of view, sometimesreferred to as 360 degrees field of view. More precisely, they cancapture a spherical field of view (i.e., 360 degrees in all spatialdirections). Furthermore, new types of output technologies have beeninvented and produced, such as head-mounted displays. These devicesallow a person to see visual content all around him/her, giving afeeling of being “immersed” into the scene captured by the 360 degreescamera. The new capture and display paradigm, where the field of view isspherical, is commonly referred to as virtual reality (VR) and isbelieved to be the common way people will experience media content inthe future.

For volumetric video, a scene may be captured using one or more 3D(three-dimensional) cameras. The cameras are in different positions andorientations within a scene. One issue to consider is that compared to2D (two-dimensional) video content, volumetric 3D video content has muchmore data, so viewing it requires lots of bandwidth (with or withouttransferring it from a storage location to a viewing device): disk I/O,network traffic, memory bandwidth, GPU (Graphics Processing Unit)upload. Capturing volumetric content also produces a lot of data,particularly when there are multiple capture devices used in parallel.

SUMMARY

The scope of protection sought for various embodiments of the inventionis set out by the independent claims. The embodiments and features, ifany, described in this specification that do not fall under the scope ofthe independent claims are to be interpreted as examples useful forunderstanding various embodiments of the invention.

Various aspects include a method, an apparatus and a computer readablemedium comprising a computer program stored therein, which arecharacterized by what is stated in the independent claims. Variousembodiments are disclosed in the dependent claims.

According to a first aspect, there is provided a method, comprisingreceiving as an input a volumetric video frame comprising volumetricvideo data; decomposing the volumetric video frame into one or morepatches, wherein a patch comprises a volumetric video data component;packing several patches, where at least two patches of the severalpatches comprise a different volumetric video data component withrespect to each other, into one video frame; generating a bitstreamcomprising an encoded video frame; signaling, in or along the bitstream,existence of encoded video frame containing patches of more than onedifferent volumetric video data component; and transmitting the encodedbitstream to a storage for rendering.

According to a second aspect, there is provided an apparatus comprisingat least means for receiving as an input a volumetric video framecomprising volumetric video data; means for decomposing the volumetricvideo frame into one or more patches, wherein a patch comprises avolumetric video data component; means for packing several patches,where at least two patches of the several patches comprise a differentvolumetric video data component with respect to each other, into onevideo frame; means for generating a bitstream comprising an encodedvideo frame; means for signaling, in or along the bitstream, existenceof encoded video frame containing patches of more than one differentvolumetric video data component; and means for transmitting the encodedbitstream to a storage for rendering.

According to a third aspect, there is provided an apparatus comprisingat least one processor, memory including computer program code, thememory and the computer program code configured to, with the at leastone processor, cause the apparatus to perform at least the following:receive as an input a volumetric video frame comprising volumetric videodata; decompose the volumetric video frame into one or more patches,wherein a patch comprises a video data component; pack several patches,where at least two patches of the several patches comprise a differentvolumetric video data component with respect to each other, into onevideo frame; generate a bitstream comprising an encoded video frame;signal, in or along the bitstream, existence of encoded video framecontaining patches of more than one different volumetric video datacomponent; and transmit the encoded bitstream to a storage forrendering.

According to a fourth aspect, there is provided a computer programproduct comprising computer program code configured to, when executed onat least one processor, cause an apparatus or a system to: receive as aninput a volumetric video frame comprising volumetric video data;decompose the volumetric video frame into one or more patches, wherein apatch comprises a video data component; pack several patches, where atleast two patches of the several patches comprise a different volumetricvideo data component with respect to each other, into one video frame;generate a bitstream comprising an encoded video frame; signal, in oralong the bitstream, existence of encoded video frame containing patchesof more than one different volumetric video data component; and transmitthe encoded bitstream to a storage for rendering.

According to an embodiment, a volumetric video data component compriseone of the following: geometry data, attribute data.

According to an embodiment, said signaling is configured to be providedin at least one structure of V-PCC bitstream.

According to an embodiment, the bitstream comprises an a signalindicating a linkage between atlas data and packed video data.

According to an embodiment, an apparatus further comprises means forencoding a type of the video data component into a bitstream of a patch.

According to an embodiment, an apparatus further comprises means formapping a patch to video packing regions signaled in the bitstream.

According to an embodiment, an apparatus further comprises means forindicating in a bitstream that a video bitstream contains a number ofpacked attributes.

According to an embodiment, the attribute comprises one of thefollowing: texture, material identification, transparency, reflectance,normal.

According to an embodiment, an apparatus further comprises means forencoding into a bitstream an indication on how patches aredifferentiated and linked together.

According to an embodiment, an apparatus further comprises means forgenerating a structure comprising information about packing regions.

According to an embodiment, an apparatus further comprises means forencoding a video frame as separate color planes.

According to an embodiment, an apparatus further comprises means forencoding into a bitstream information about codec being used forencoding the video frame.

According to an embodiment, an apparatus further comprises means forgenerating a structure identifying an encoded bitstream to which ametadata is related to.

According to an embodiment, the computer program product is embodied ona non-transitory computer readable medium.

DESCRIPTION OF THE DRAWINGS

In the following, various embodiments will be described in more detailwith reference to the appended drawings, in which

FIG. 1 shows an example of an encoding process;

FIG. 2 shows an example of a decoding process;

FIG. 3 shows an example of a volumetric video compression process;

FIG. 4 shows an example of a volumetric video decompression process;

FIG. 5 shows an example of a visual volumetric video-based coding (3VC)bitstream;

FIG. 6 shows an example of geometry and texture packed to one videoframe and patch data packet in tile groups that correspond to packetregions;

FIG. 7 shows an example of a geometry and texture packed to one videoframe and patch data packed in one tile group; and

FIG. 8 shows examples on packed_video( ) and packed_patches( );

FIG. 9 is a flowchart illustrating a method according to an embodiment:and

FIG. 10 shows an apparatus according to an embodiment.

DESCRIPTION OF EXAMPLE EMBODIMENTS

In the following, several embodiments will be described in the contextof volumetric video encoding and decoding. In particular, the severalembodiments enable packing and signaling volumetric video in one videocomponent.

A video codec comprises an encoder that transforms the input video intoa compressed representation suited for storage/transmission, and adecoder that can un-compress the compressed video representation backinto a viewable form. An encoder may discard some information in theoriginal video sequence in order to represent the video in a morecompact form (i.e. at lower bitrate). FIG. 1 illustrates an encodingprocess of an image as an example. FIG. 1 shows an image to be encoded(I_(n)); a predicted representation of an image block (P′_(n)); aprediction error signal (D_(n)); a reconstructed prediction error signal(D′_(n)); a preliminary reconstructed image (I′_(n)) a finalreconstructed image (R′_(n)); a transform (T) and inverse transform(T⁻¹); a quantization (Q) and inverse quantization (Q⁻¹); entropyencoding (E); a reference frame memory (RFM); inter prediction(P_(inter)); intra prediction (P_(intra)); mode selection (MS) andfiltering (F). An example of a decoding process is illustrated in FIG. 2. FIG. 2 illustrates a predicted representation of an image block(P′_(n)); a reconstructed prediction error signal (D′_(n)); apreliminary reconstructed image (I′_(n)) a final reconstructed image(R′_(n)); an inverse transform (T⁻¹); an inverse quantization (Q⁻¹); anentropy decoding (E⁻¹); a reference frame memory (RFM); a prediction(either inter or intra) (P); and filtering (F).

Volumetric video refers to a visual content that may have been capturedusing one or more three-dimensional (3D) cameras. When multiple camerasare in use, the captured footage is synchronized so that the camerasprovide different viewpoints to the same world. In contrast totraditional 2D/3D video, volumetric video describes a 3D model of theworld where the viewer is free to move and observe different parts ofthe world.

Volumetric video enables the viewer to move in six degrees of freedom(6DOF): in contrast to common 360° video, where the user has from 2 to 3degrees of freedom (yaw, pitch, and possibly roll), a volumetric videorepresents a 3D volume of space rather than a flat image plane.Volumetric video frames contain a large amount of data because theymodel the contents of a 3D volume instead of just a two-dimensional (2D)plane. However, only a relatively small part of the volume changes overtime. Therefore, it may be possible to reduce the total amount of databy only coding information about an initial state and changes which mayoccur between frames. Volumetric video can be rendered from synthetic 3Danimations, reconstructed from multi-view video using 3D reconstructiontechniques such as structure from motion, or captured with a combinationof cameras and depth sensors such as LiDAR (Light Detection andRanging), for example.

Volumetric video data represents a three-dimensional scene or object,and can be used as input for AR (Augmented Reality), VR (VirtualReality) and MR (Mixed Reality) applications. Such data describesgeometry (shape, size, position in three-dimensional space) andrespective attributes (e.g. color, opacity, reflectance, . . . ), plusany possible temporal changes of the geometry and attributes at giventime instances (like frames in two-dimensional (2D) video). Volumetricvideo is either generated from three-dimensional (3D) models, i.e. CGI(Computer Generated Imagery), or captured from real-world scenes using avariety of capture solutions, e.g. multi-camera, laser scan, combinationof video and dedicated depth sensors, and more. Also, a combination ofCGI and real-world data is possible. Representation formats for suchvolumetric data comprises triangle meshes, point clouds, or voxel.Temporal information about the scene can be included in the form ofindividual capture instances, i.e. volumetric video frame.

Because volumetric video describes a 3D scene (or object), such data canbe viewed from any viewpoint. Therefore, volumetric video is animportant format for any AR, VR or MR applications, especially forproviding 6DOF viewing capabilities.

Increasing computational resources and advances in 3D data acquisitiondevices has enabled reconstruction of highly detailed volumetric videorepresentations of natural scenes. Infrared, lasers, time-of-flight andstructured light are all examples of devices that can be used toconstruct 3D video data. Representation of the 3D data depends on howthe 3D data is used. Dense voxel arrays have been used to representvolumetric medical data. In 3D graphics, polygonal meshes areextensively used. Point clouds on the other hand are well suited forapplications such as capturing real world 3D scenes where the topologyis not necessarily a 2D manifold. Another way to represent 3D data iscoding this 3D data as a set of texture and depth map as is the case inthe multi-view plus depth. Closely related to the techniques used inmulti-view plus depth is the use of elevation maps, and multi-levelsurface maps.

In dense point clouds or voxel arrays, the reconstructed 3D scene maycontain tens or even hundreds of millions of points. If suchrepresentations are to be stored or interchanged between entities, thenefficient compression becomes essential. Standard volumetric videorepresentation formats, such as point clouds, meshes, voxel suffer frompoor temporal compression performance. Identifying correspondences formotion-compensation in 3D space is an ill-defined problem, as both thegeometry and respective attributes may change. For example, temporalsuccessive “frames” do not necessarily have the same number of meshes,points or voxel. Therefore, compression of dynamic 3D scenes may beinefficient. 2D-video based approaches for compressing volumetric data,i.e. multiview and depth, have much better compression efficiency, butrarely cover the full scene. Therefore, they provide only limited 6DOFcapabilities.

Instead of the above-mentioned approach, a 3D scene, represented asmeshes, points, and/or voxel, can be projected onto one, or moregeometries. These geometries are “unfolded” onto 2D planes (two planesper geometry: one for texture, one for depth), which may be then encodedusing standard 2D video compression techniques. Relevant projectiongeometry information is transmitted alongside the encoded video files tothe decoder. The decoder decodes the video and performs the inverseprojection to regenerate the 3D scene in any desired representationformat (not necessarily the starting format).

Projecting volumetric models onto 2D planes allows for using standard 2Dvideo coding tools with highly efficient temporal compression. Thus,coding efficiency may be increased greatly. Using geometry-projectionsinstead of prior-art 2D-video based approaches, i.e. multiview anddepth, provide a better coverage of the scene (or object). Thus, 6DOFcapabilities may be improved. Using several geometries for individualobjects improves the coverage of the scene further. Furthermore,standard video encoding hardware can be utilized for real-timecompression/de-compression of the projected planes. The projection andreverse projection steps are of low complexity.

FIG. 3 illustrates an overview of an example of a compression process ofa volumetric video. Such process may be applied for example in MPEGPoint Cloud Coding (PCC). The process starts with an input point cloudframe 301 that is provided for patch generation 302, geometry imagegeneration 304 and texture image generation 305.

The patch generation 302 process aims at decomposing the point cloudinto a minimum number of patches with smooth boundaries, while alsominimizing the reconstruction error. For patch generation, the normal atevery point can be estimated. An initial clustering of the point cloudcan then be obtained by associating each point with one of the followingsix oriented planes, defined by their normals:

-   -   (1.0, 0.0, 0.0),    -   (0.0, 1.0, 0.0),    -   (0.0, 0.0, 1.0),    -   (−1.0, 0.0, 0.0),    -   (0.0, −1.0, 0.0), and    -   (0.0, 0.0, −1.0)

More precisely, each point may be associated with the plane that has theclosest normal (i.e. maximizes the dot product of the point normal andthe plane normal).

The initial clustering may then be refined by iteratively updating thecluster index associated with each point based on its normal and thecluster indices of its nearest neighbors. The final step may compriseextracting patches by applying a connected component extractionprocedure.

Patch info determined at patch generation 302 for the input point cloudframe 301 is delivered to packing process 303, to geometry imagegeneration 304 and to texture image generation 305. The packing process303 aims at mapping the extracted patches onto a 2D plane, while tryingto minimize the unused space, and guaranteeing that every T×T (e.g.16×16) block of the grid is associated with a unique patch. It should benoticed that T may be a user-defined parameter. Parameter T may beencoded in the bitstream and sent to the decoder.

The used simple packing strategy iteratively tries to insert patchesinto a W×H grid. W and H may be user-defined parameters, whichcorrespond to the resolution of the geometry/texture images that will beencoded. The patch location is determined through an exhaustive searchthat is performed in raster scan order. The first location that canguarantee an overlapping-free insertion of the patch is selected and thegrid cells covered by the patch are marked as used. If no empty space inthe current resolution image can fit a patch, then the height H of thegrid may be temporarily doubled, and search is applied again. At the endof the process, H is clipped so as to fit the used grid cells.

The geometry image generation 304 and the texture image generation 305are configured to generate geometry images and texture imagesrespectively. The image generation process may exploit the 3D to 2Dmapping computed during the packing process to store the geometry andtexture of the point cloud as images. In order to better handle the caseof multiple points being projected to the same pixel, each patch may beprojected onto two images, referred to as layers. For example, let H(u,y) be the set of points of the current patch that get projected to thesame pixel (u, v). The first layer, also called a near layer, stores thepoint of H(u, v) with the lowest depth DO. The second layer, referred toas the far layer, captures the point of H(u, v) with the highest depthwithin the interval [D0, D0+Δ], where Δ is a user-defined parameter thatdescribes the surface thickness. The generated videos may have thefollowing characteristics:

-   -   Geometry: W×H YUV420-8 bit,    -   Texture: W×H YUV420-8 bit,

It is to be noticed that the geometry video is monochromatic. Inaddition, the texture generation procedure exploits thereconstructed/smoothed geometry in order to compute the colors to beassociated with the re-sampled points.

The geometry images and the texture images may be provided to imagepadding 307. The image padding 307 may also receive as an input anoccupancy map (OM) 306 to be used with the geometry images and textureimages. The occupancy map 306 may comprise a binary map that indicatesfor each cell of the grid whether it belongs to the empty space or tothe point cloud. In other words, the occupancy map (OM) may be a binaryimage of binary values where the occupied pixels and non-occupied pixelsare distinguished and depicted respectively. The occupancy map mayalternatively comprise a non-binary image allowing additionalinformation to be stored in it. Therefore, the representative values ofthe DOM (Deep Occupancy Map) may comprise binary values or other values,for example integer values. It should be noticed that one cell of the 2Dgrid may produce a pixel during the image generation process. Such anoccupancy map may be derived from the packing process 303.

The padding process 307, for which the present embodiment are related,aims at filling the empty space between patches in order to generate apiecewise smooth image suited for video compression. For example, in asimple padding strategy, each block of T×T (e.g. 16×16) pixels iscompressed independently. If the block is empty (i.e. unoccupied, i.e.all its pixels belong to empty space), then the pixels of the block arefilled by copying either the last row or column of the previous T×Tblock in raster order. If the block is full (i.e. occupied, i.e., noempty pixels), nothing is done. If the block has both empty and filledpixels (i.e. edge block), then the empty pixels are iteratively filledwith the average value of their non-empty neighbors.

The padded geometry images and padded texture images may be provided forvideo compression 308. The generated images/layers may be stored asvideo frames and compressed using for example the H.265 video codecaccording to the video codec configurations provided as parameters. Thevideo compression 308 also generates reconstructed geometry images to beprovided for smoothing 309, wherein a smoothed geometry is determinedbased on the reconstructed geometry images and patch info from the patchgeneration 302. The smoothed geometry may be provided to texture imagegeneration 305 to adapt the texture images.

The patch may be associated with auxiliary information beingencoded/decoded for each patch as metadata. The auxiliary informationmay comprise index of the projection plane, 2D bounding box, 3D locationof the patch.

For example, the following metadata may be encoded/decoded for everypatch:

-   -   index of the projection plane        -   Index 0 for the planes (1.0, 0.0, 0.0) and (−1.0, 0.0, 0.0)        -   Index 1 for the planes (0.0, 1.0, 0.0) and (0.0, −1.0, 0.0)        -   Index 2 for the planes (0.0, 0.0, 1.0) and (0.0, 0.0, −1.0)    -   2D bounding box (u0, v0, u1, v1)    -   3D location (x0, y0, z0) of the patch represented in terms of        depth δ0, tangential shift s0 and bitangential shift r0.        According to the chosen projection planes, (δ0, s0, r0) may be        calculated as follows:        -   Index 0, δ0=x0, s0=z0 and r0=y0        -   Index 1, δ0=y0, s0=z0 and r0=x0        -   Index 2, δ0=z0, s0=x0 and r0=y0

Also, mapping information providing for each T×T block its associatedpatch index may be encoded as follows:

-   -   For each T×T block, let L be the ordered list of the indexes of        the patches such that their 2D bounding box contains that block.        The order in the list is the same as the order used to encode        the 2D bounding boxes. L is called the list of candidate        patches.    -   The empty space between patches is considered as a patch and is        assigned the special index 0, which is added to the candidate        patches list of all the blocks.    -   Let I be index of the patch, which the current T×T block belongs        to, and let J be the position of I in L. Instead of explicitly        coding the index I, its position J is arithmetically encoded        instead, which leads to better compression efficiency.

An example of such patch auxiliary information is atlas data defined inISO/IEC 23090-5.

The occupancy map consists of a binary map that indicates for each cellof the grid whether it belongs to the empty space or to the point cloud.One cell of the 2D grid produces a pixel during the image generationprocess.

The occupancy map compression 310 leverages the auxiliary informationdescribed in previous section, in order to detect the empty T×T blocks(i.e. blocks with patch index 0). The remaining blocks may be encoded asfollows: The occupancy map can be encoded with a precision of a B0×B0blocks. B0 is a configurable parameter. In order to achieve losslessencoding, B0 may be set to 1. In practice B0=2 or B0=4 results invisually acceptable results, while significantly reducing the number ofbits required to encode the occupancy map.

The compression process may comprise one or more of the followingexample operations:

-   -   Binary values may be associated with B0×B0 sub-blocks belonging        to the same T×T block. A value 1 associated with a sub-block, if        it contains at least a non-padded pixel, and 0 otherwise. If a        sub-block has a value of 1 it is said to be full, otherwise it        is an empty sub-block.    -   If all the sub-blocks of a T×T block are full (i.e., have value        1). The block is said to be full. Otherwise, the block is said        to be non-full.    -   A binary information may be encoded for each T×T block to        indicate whether it is full or not.    -   If the block is non-full, an extra information indicating the        location of the full/empty sub-blocks may be encoded as follows:    -   Different traversal orders may be defined for the sub-blocks,        for example horizontally, vertically, or diagonally starting        from top right or top left corner    -   The encoder chooses one of the traversal orders and may        explicitly signal its index in the bitstream.    -   The binary values associated with the sub-blocks may be encoded        by using a run-length encoding strategy.        -   The binary value of the initial sub-block is encoded.        -   Continuous runs of 0s and 1s are detected, while following            the traversal order selected by the encoder.        -   The number of detected runs is encoded.        -   The length of each run, except of the last one, is also            encoded.

FIG. 4 illustrates an overview of a de-compression process for MPEGPoint Cloud Coding (PCC). A de-multiplexer 401 receives a compressedbitstream, and after de-multiplexing, provides compressed texture videoand compressed geometry video to video decompression 402. In addition,the de-multiplexer 401 transmits compressed occupancy map to occupancymap decompression 403. It may also transmit a compressed auxiliary patchinformation to auxiliary patch-info compression 404. Decompressedgeometry video from the video decompression 402 is delivered to geometryreconstruction 405, as are the decompressed occupancy map anddecompressed auxiliary patch information. The point cloud geometryreconstruction 405 process exploits the occupancy map information inorder to detect the non-empty pixels in the geometry/textureimages/layers. The 3D positions of the points associated with thosepixels may be computed by leveraging the auxiliary patch information andthe geometry images.

The reconstructed geometry image may be provided for smoothing 406,which aims at alleviating potential discontinuities that may arise atthe patch boundaries due to compression artifacts. The implementedapproach moves boundary points to the centroid of their nearestneighbors. The smoothed geometry may be transmitted to texturereconstruction 407, which also receives a decompressed texture videofrom video decompression 402. The texture reconstruction 407 outputs areconstructed point cloud. The texture values for the texturereconstruction are directly read from the texture images.

The point cloud geometry reconstruction process exploits the occupancymap information in order to detect the non-empty pixels in thegeometry/texture images/layers. The 3D positions of the pointsassociated with those pixels are computed by levering the auxiliarypatch information and the geometry images. More precisely, let P be thepoint associated with the pixel (u, v) and let (δ0, s0, r0) be the 3Dlocation of the patch to which it belongs and (u0, v0, u1, v1) its 2Dbounding box. P can be expressed in terms of depth δ(u, v), tangentialshift s(u, v) and bi-tangential shift r(u, v) as follows:

δ(u,v)=δ0+g(u,v)

s(u,v)=s0−u0+u

r(u,v)=r0−v0+v

where g(u, v) is the luma component of the geometry image.

For the texture reconstruction, the texture values can be directly readfrom the texture images. The result of the decoding process is a 3Dpoint cloud reconstruction.

Visual volumetric video-based Coding (3VC) relates to a core part sharedbetween MPEG V-PCC (Video-based Point Cloud Compression, ISO/IEC23090-5) and MPEG MIV (MPEG Immersive Video, ISO/IEC 23090-12). In thehighest level, 3VC metadata is carried in vpcc_unit, which consist ofheader and payload pairs. A general syntax for vpcc_unit structure isgiven below:

vpcc_unit( numBytesInVPCCUnit) { Descriptor  vpcc_unit_header( ) vpcc_unit_payload( )  while( more_data_in_vpcc_unit)  trailing_zero_8bits /* equal to 0x00 */ f(8) }

A syntax for vpcc_unit_header being defined by the vpcc_unit is shownbelow:

vpcc_unit_header( ) { Descriptor  vuh_unit_type u(5)  if vuh_unit_type == VPCC_AVD ∥ vuh_unit_type = = VPCC_GVD ∥   vuh_unit_type = = VPCC_OVD ∥vuh_unit_type = = VPCC_AD) {   vuh_vpcc_parameter_set_id u(4)  vuh_atlas_id u(6)  }  if( vuh_unit_type = = VPCC_AVD ) {  vuh_attribute_index u(7)   vuh_attribute_dimension_index u(5)  vuh_map_index u(4)   vuh_auxiliary_video_flag u(1)  } else if(vuh_unit_type = = VPCC_GVD ) {   vuh_map_index u(4)  vuh_auxiliary_video_flag u(1)   vuh_reserved_zero_12bits u(12)  }elseif( vuh_unit_type = = VPCC_OVD ∥ vuh_unit_type = = VPCC_AD)  vuh_reserved_zero_17bits u(17)  else   vuh_reserved_zero_27bits u(27)}

vpcc_unit also defines vpcc_unit_payload, a syntax of which is presentedbelow:

vpcc_unit_payload( ) { Descriptor  if vuh_unit_type = = VPCC_VPS )  vpcc_parameter_set( )  else if( vuh_unit_type = = VPCC_AD )  atlas_sub_bitstream( )  else if( vuh_unit_type   = =   VPCC_OVD ∥   vuh_unit_type     = =   VPCC_GVD    vuh_unit_type = = VPCC_AVD) ∥  video_sub_bitstream( ) }

3VC metadata is contained in atlas_sub_bitstream( ) which may contain asequence of NAL units including header and payload data.nal_unit_header( ) is used to define how to process the payload data.NumBytesInNalUnit specifies the size of the NAL unit it bytes. Thisvalue (i.e. the size of the NAL unit it bytes) is required for decodingof the NAL unit. Some form of demarcation of NAL unit boundaries may benecessary to enable inference of NumBytesInNalUnit. One such demarcationmethod is specified in Annex C of V-PCC (ISO/IEC 23090-5) standard forthe sample stream format.

3VC atlas coding layer (ACL) is specified to efficiently represent thecontent of the patch data. The NAL is specified to format such data andprovide header information in a manner appropriate for conveyance on avariety of communication channel or storage media. All data is containedin NAL units, each of which contains an integer number of bytes. A NALunit specifies a generic format for use in both packet-oriented andbitstream systems. The format of NAL units for both packet-orientedtransport and sample streams is identical except that in the samplestream format specified in Annex C of V-PCC standard, each NAL unit canbe preceded by an additional element that specifies the size of the NALunit.

General NAL unit syntax is presented below:

nal_unit( NumBytesInNalUnit) { Descriptor  nal_unit_header( ) NumBytesInRbsp = 0  for( i = 2; i < NumBytesInNalUnit; i++ )  rbsp_byte[ NumBytesInRbsp++ ] b(8) }

nal_unit defines nal_unit_header, a syntax of which is given below:

nal_unit_header() { Descriptor  nal_forbidden_zero_bit f(1) nal_unit_type u(6)  nal_layer_id u(6)  nal_temporal_id_plus1 u(3) }

In the nal_unit_header( ) syntax nal_unit_type specifies the type of theRBSP (Raw Byte Sequence Payload) data structure contained in the NALunit as specified in Table 7.3 of V-PCC standard. nal_layer_id specifiesthe identifier of the layer to which an ACL NAL unit belongs or theidentifier of a layer to which a non-ACL NAL unit applies. The value ofnal_layer_id shall be n in the range of 0 to 62, inclusive. The value of63 may be specified in the future by ISO/IEC. Decoders conforming to aprofile specified in Annex A of the current version of V-PCC standardshall ignore (i.e., remove from the bitstream and discard) all NAL unitswith values of nal_layer_id not equal to 0.

rbsp_byte[i] is the i-th byte of an RBSP. An RBSP is specified as anordered sequence of bytes as follows:

The RBSP contains a string of data bits (SODB) as follows:

-   -   if the SODB is empty (i.e., zero bits in length), the RBSP is        also empty.    -   Otherwise, the RBSP contains the SODB as follows:        -   The first byte of the RBSP contains the first (most            significant, left-most) eight bits of the SODB; the next            byte of the RBSP contains the next eight bits of the SODB,            etc. until fewer than eight bits of the SODB remain.        -   The rbsp_trailing_bits( ) syntax structure is present after            the SODB as follows:            -   The first (most significant, left-most) bits of the                final RBSP byte contain the remaining bits of the SODB                (if any).            -   The next bit consists of a single bit equal to 1 (i.e.,                rbsp_stop_one_bit).            -   When the rbsp_stop_one_bit is not the last bit of a                byte-aligned byte, one or more bits equal to 0 (i.e.                instances of rbsp_alignment_zero_bit) are present to                result in byte alignment.        -   One or more cabac_zero_word 16-bit syntax elements equal to            0x0000 may be present in some RBSPs after the            rbsp_trailing_bits( ) at the end of the RBSP.

Syntax structures having these RBSP properties are denoted in the syntaxtables using an “_rbsp” suffix. These structures are carried within NALunits as the content of the rbsp_byte[i] data bytes. As an example of acontent:

-   -   atlas_sequence_parameter_set_rbsp( ) is used to carry parameters        related to a sequence of 3VC frames.    -   atlas_frame_parameter_set_rbsp( ) is used to carry parameters        related to a specific frame, and can be applied for a sequence        of frames as well.    -   sei_rbsp( ) is used to carry SEI messages in NAL units.    -   atlas_tile_group_layer_rbsp( ) is used to carry patch layout        information for tile groups.

When the boundaries of the RBSP are known, the decoder can extract theSODB from the RBSP by concatenating the bits of the bytes of the RBSPand discarding the rbsp_stop_one_bit, which is the last (leastsignificant, right-most) bit equal to 1, and discarding any following(less significant, farther to the right) bits that follow it, which areequal to 0. The data necessary for the decoding process is contained inthe SODB part of the RBSP.

In the following, some RBSP syntaxes are represented:

Syntax atlas_tile_group_rbsp is presented below:

atlas_tile_group_layer_rbsp( ) { Descriptor  atlas_tile_group_header( ) if( atgh_type != SKIP_TILE_GRP )   atlas_tile_group_data_unit( ) rbsp_trailing_bits( ) }

Syntax atlas_tile_group_header is presented below:

atlas_tile_group_header( ) { Descriptor atgh_atlas_frame_parameter_set_id ue(v)  atgh_address u(v)  atgh_typeue(v)  atgh_atlas_frm_order_ent_lsb u(v)  if(asps_num_ref_atlas_frame_lists_in_asps > 0 )  atgh_ref_atlas_frame_list_sps_flag u(1)  if(atgh_ref_atlas_frame_list_sps_flag == 0)   ref_list_struct(asps_num_ref_atlas_frame_lists_in_asps )  else if(asps_num_ref_atlas_frame_lists_in_asps > l )  atgh_ref_atlas_frame_list_idx u(v)  for( i = 0; j <NumLtrAtlasFrmEntries; i++ ) {   atgh_additional_afoc_lsb_present_flag[j ] u(1)   if( atgh_additional_afoc_lsb_present_flag[ j ] )   atgh_additional_afoc_lsb_val[ j ] u(v)  }  if( atgh_type ! =SKIP_TILE_GRP ) {   if(asps_normal_axis_limits_quantization_enabled_flag ) {   atgh_pos_min_z_quantizer u(5)    if(asps_normal_axis_max_delta_value_enabled_flag )    atgh_pos_delta_max_z_quantizer u(5)   }   if( asps_patch_sizequantizer present flag ) {    atgh_patch_size_x_info_quantizer u(3)   atgh_patch_size_y_info_quantizer u(3)   }   if(afps_raw_3d_pos_bit_count_explicit_mode_flag)   atgh_raw_3d_pos_axis_bit_count_minus1 u(v)   if( atgh_type = =    P_TILE_GRP     && num_ref_entries_RlsIdx 1 > 1 ) {   atgh_num_ref_idx_active_override_flag u(1)    if(atgh_num_ref_idx_active_override_flag )    atgh_num_ref_idx_active_minus1 ue(v)   }  }  byte_alignment( ) }

General atlas_tile_group_data_unit syntax is given below:

atlas_tile_group_data_unit( ) { Descriptor  p = 0  atgdu_patch_mode [ p] ue(v)  while( atgdu_patch_mode[ p ] !=I_END && atgdu_patch_mode[ p ]!= P_END){   patch_information_data( p, atgdu_patch_mode[ p ] )   p ++  atgdu_patch_mode[ p ] ue(v)  }  AtgduTotalNumberOfPatches = p byte_alignment( ) }

patch_information_data syntax is defined as below:

patch_information_data ( patchIdx, patchMode ) { Descriptor  if (atgh_type = = SKIP_TILE_GR )   skip_patch_data_unit( patchIdx)  else if( atgh_type = = P_TILE_GR ) {   if(patchMode = = P_SKIP )   skip_patch_data_unit( patchIdx)   else if(patchMode = = P_MERGE )   merge_patch_data_unit( patchIdx )   else if( patchMode = = P_INTRA )   patch_data_unit( patchIdx )   else if( patchMode = = P_INTER)   inter_patch_data_unit( patchIdx )   else if( patchMode = = P_RAW )   raw_patch_data_unit( patchIdx )   else if( patchMode = = P_EOM )   com_patch_data_unit( patchIdx )  }  else if ( atgh_type = = I_TILE_GR) {   if( patchMode = = I_INTRA )    patch_data_unit( patchIdx )   elseif( patchMode = = I_RAW )    raw_patch_data_unit( patchIdx )   else if(patchMode = = I_EOM )    com_patch_data_unit( patchIdx )  } }

patch_data_unit syntax is defined as follows:

patch_data_unit( patchIdx ) { Descriptor  pdu_2d_pos_x[ patchIdx ] u(v) pdu_2d_pos_y[ patchIdx ] u(v)  pdu_2d_delta_size_x[ patchIdx ] se(v) pdu_2d_delta_size_y[ patchIdx ] se(v)  pdu_3d_pos_x[ patchIdx ] u(v) pdu_3d_pos_y[ patchIdx ] u(v)  pdu_3d_pos_min_z[ patchIdx ] u(v)  if(asps_normal_axis_max_delta_value_enabled_flag )  pdu_3d_pos_delta_max_z[ patchIdx ] u(v)  pdu_projection_id[ patchIdx ]u(v)  pdu_orientation_index[ patchIdx ] u(v)  if(afps_lod_mode_enabled_flag ) {   pdu_lod_enabled_flag[ patchIndex ] u(1)  if( pdu_lod enabled flag[ patchIndex ] > 0 ) {   pdu_lod_scale_x_minus1[ patchIndex ] ue(v)    pdu_lod_scale_y[patchIndex ] ue(v)   }  } u(v)  if(asps_point_local_reconstruction_enabled_flag )  point_local_reconstruction_data( patchIdx ) }

Annex F of 3VC V-PCC specification (ISO/IEC 23090-5) describes differentSEI messages that have been defined. SEI messages assist in processesthat relate to decoding, reconstruction, display, or other purposes.Annex F defines two types of SEI messages: essential and non-essential.3VC SEI messages are signaled in sei_rspb( ) which is shown in below:

sei_rbsp( ) { Descriptor  do   sei_message( )  while( more_rbsp_data( ))  rbsp_trailing_bits( ) }

Non-essential SEI messages may not be required by the decoding process.Conforming decoders may not be required to process this information foroutput order conformance.

Specification for presence of non-essential SEI messages is alsosatisfied when those SEI messages (or some subset of them) are conveyedto decoders (or to the HRD (Hypothetical Reference Decoder)) by othermeans not specified in 3VC V-PCC specification. When present in thebitstream, non-essential SEI messages shall obey the syntax andsemantics as specified in Annex F. When the content of a non-essentialSEI message is conveyed for the application by some means other thanpresence within the bitstream, the representation of the content of theSEI message is not required to use the same syntax specified in Annex F.For the purpose of counting bits, only the appropriate bits that areactually present in the bitstream are counted.

Essential SEI messages are an integral part of the V-PCC bitstream andshould not be removed from the bitstream. The essential SEI messages maybe categorized into two types:

-   -   Type-A essential SEI messages: These SEI messages contain        information required to check bitstream conformance and for        output timing decoder conformance. Every V-PCC decoder        conforming to point A should not discard any relevant Type-A        essential SEI messages, and shall consider them for bitstream        conformance and for output timing decoder conformance.    -   Type-B essential SEI messages: V-PCC decoders that wish to        conform to a particular reconstruction profile should not        discard any relevant Type-B essential SEI messages, and shall        consider them for 3D point cloud reconstruction and conformance        purposes.

As mentioned, visual volumetric video-based coding (3VC) is a new namefor a common core part between ISO/IEC 23090-5 (formerly V-PCC) andISO/IEC 23090-12 (formerly MIV). 3VC will not be issued as a separatedocument, but as part of ISO/IEC 23090-5 (expected to include clauses1-8 of the current V-PCC text). ISO/IEC 23090-12 will refer to thiscommon part. ISO/IEC 23090-5 will be renamed to 3VC PCC, ISO/IEC23090-12 renamed to 3VC MIV.

3VC bitstream structure is shown as an example in FIG. 5 . 3VC HLS (HighLevel Syntax) is utilized to encapsulate compressed volumetric videodata, whether it is encoded by V-PCC or MIV standard. Both standards usesimilar concept of video data components plus metadata (comprising e.g.patch auxiliary information) and in both standards the video data may beencapsulated in at least two separate video bitstreams. That isoccupancy and geometry video bitstreams in case of V-PCC, and geometryand texture video bitstreams in case of MIV. This separation has showedbenefits in case of compression efficiency and especially in case of bigvolumetric scenes where the amount of patch data exceeds video encodercapability for one video frame.

However, use of multiple components may also create number of challengeson implementations. First, there may be a common problem how tosynchronize multiple video bitstreams, especially in a situation, wherea decoder skips a frame (as in the current Android APIs (ApplicationProgramming Interface) documentation, for example)

If a frame from any of the video bitstreams is skipped, volumetric videoshould not be reconstructed, unless copies of the reconstructedvolumetric video frame or the decoded video frames are available. It ishowever understood that this requires extra copying of the data andtherefore lower performance can be expected.

Another problem occurring in the implementation is memory requirements,because video decoders may allocate maximum required memory based onvideo bitstream tier and level values.

In addition, some of the current multimedia framework does not providean interface that would support decoding or access to more than onevideo bitstream. For example, html <video> element provides an access toonly one decoded frame at a time, even in a case, where there is morethan one video track in a given file.

It is a purpose of the present embodiments to enable signaling andstorage of patches from different video components in one video frame. Avideo component may be an occupancy, a geometry or an attribute.Examples of an attribute are texture, normal, reflectance. Thus, thepurpose of the present embodiments is signaling and storage of patchesfrom e.g. geometry and attribute (e.g. texture) in one video frame. Itis also an aim to maintain compatibility within 3VC with regards to theV-PCC and MIV design. However, this should not be considered to be alimitation. The functionality provided by the present embodiments enableonly one video data component when geometry and texture attribute arepacked together, but the functionality can also be used to minimize thenumber of video data components caring attributes, when a number ofattributes (e.g. texture, normal, transparency) are packed together.

V-PCC parameter set Atlas data Occupancy Video Data Geometry Video DataAttribute Video Data Packed Video Data

The present embodiments are implemented by means of one or more of thefollowing:

-   -   1. The present embodiments define a new vuh_unit_type and a new        packed_video( ) structure for vpcc_parameter_set( ). In addition        a new vpcc_unit_type is defined. The purpose of the        packed_video( ) structure is to provide information about the        packing regions.    -   2. The present embodiments also define a special use case where        only attributes are packed in one video frame:        -   A new identifier value is defined to indicate a decoder that            there is a number of attributes packed in one video            bitstream.        -   A new SEI message provides information about the packing            regions.    -   3. The present embodiments define a new packed_patches( ) syntax        structure for atlas_sequence_parameter_set( ). This structure        defines constraints on the tile groups of atlas to be aligned        with regions of packed video. Patches can be mapped based on        patch index in a given tile group. The structure gives a way of        interpreting patches as 2D and 3D patches.    -   4. The present embodiments also define new patch modes in        patch_information_data and new patch data unit structures. Patch        data type can be signaled in a patch itself, or the patch may be        mapped to video regions signaled in patched_video( ) structure        (defined in item 1 above)    -   5. The present embodiments also define a new SEI message that        leverages signaling separate patch layouts. Such SEI message is        introduced to atlas_sub_bitstream( ) which signals the video        track containing the patch. This feature enables flexibly        signaling patches of different types.

Each of the previous elements of the present embodiments are discussedin more detailed manner in the following:

-   -   1. Vuh_Unit_Type and Packed_Video( ) Structure in        Vpcc_Parameter_Set( ) Structure

The purpose of the new vuh_unit_type is to indicate that VPCC unitcontains a video bitstream containing patch data from differentcomponents. vuh_unit_type may have values from 0 to 5 as defined in thefollowing table however vuh_unit_type may be composed differently.

vuh_unit_type Identifier V-PCC Unit Type Description 0 VPCC_VPS V-PCCparameter set V-PCC level parameters 1 VPCC_AD Atlas data Atlasinformation 2 VPCC_OVD Occupancy Video Data Occupancy information 3VPCC_GVD Geometry Video Data Geometry information 4 VPCC_AVD AttributeVideo Data Attribute information 5 VPCC_PVD Packed Video Data Packedinformation 6...31 VPCC_RSVD Reserved —

A new VPCC unit type vpcc_parameter_set( ) provides information on howthe packed video frame should be interpreted. This is accomplished bydefining a new extension mechanism that is signalled byvps_packed_video_extension_flag. When this flag is set, a packed_video() structure is provided in vpcc_parameter_set( ). In the following, anexample of vpcc_parameter_set( ) structure is given:

vpcc_parameter_set( ) { Descriptor  profile_tier_level( ) vps_vpcc_parameter_set_id u(4)  vps_atlas_count_minus1 u(6)  for(j = 0;j < vps_atlas_count_minus1 + 1; j++ ) {   vps_frame_width[ j ] u(16)  vps_frame_height[ j ] u(16)   ...  }  vps_packed_video_extension_flagu(1)  if(vps_packed_video_extension_flag)   packed_video( ) vps_extension_present_flag u(1)  if(vps_extension_present_flag) {  vps_extension_length_minus1 ue(v)   for( j = 0; j <vps_extension_length_minus1 + 1; j++ ) {    vps_extension_data_byte u(8)  }  }  byte_alignment( ) }

A new packed_video( ) structure provides the linkage between the atlasand the packed video bitstreams as well information about the componentspacked in a video bitstream and how to interpret them. FIG. 8 a shows anexample on packed_video( ) indicating how to interpret a videocomponent. An example of packed_video( ) structure is shown below:

packed_video( ) { Descriptor  for(j = 0; j < vps_atlas_count_minus1 + 1;j++ ) {   pv_packed_count_minus1[ j ] u(4)   for(i = 0; i <vps_atlas_count_minus1 + 1; i++ ) {    pv_codec_id[ j ][ i ] u(8)   pv_num_regions_minus1[ j ][ i ] u(8)    for( k = 0; k <=pv_num_regions_minus1[ j ][ i ]; k++ ) {     pv_region_type_id[ j ][ i][ k ] u(4)     pv_region_top_left_x[ j ][ i ][ k ] u(v)    pv_region_top_left_y[ j ][ i ][ k ] u(v)     pv_region_width_minus1[j ][ i ][ k ] u(v)     pv_region_height_minus1[ j ][ i ][ k ] u(v)    }  }  } }

In the packed_video( ) structure a definition pv_packed_count_minus1[j]plus one specifies the number of packed video bitstreams associated withthe atlas with index j. pv_packed_count_minus1 shall be in the range of0 to 15, inclusive.

In the packed_video( ) structure a definition pv_codec_id[j][i]indicates the identifier of the codec used to compress the packed videodata with index i for the atlas with index j. pv_codec_id[j][i] shall bein the range of 0 to 255, inclusive. This codec may be identifiedthrough a component codec mapping SEI message.

In the packed_video( ) structure a definitionpv_num_regions_minus1[j][i] plus 1 specifies the number of regions in apacked video bitstream with index i of atlas with index j.pv_num_regions_minus1 shall be in the range of 0 to 254.

In the packed_video( ) structure a definition pv_region_type_id[j][i][k]specifies type of region with index k in a packed video bitstream withindex i of atlas with index j. In the following a list of possibleregion types is given. It is to be appreciated that the list is givenfor understanding purposes only, and should not be unnecessarilyinterpreted as a limiting list:

pv_region_type_id [ j ][ i ] Attribute type 0 Occupancy 1 Geometry 2Texture 3 Material ID 4 Transparency 5 Reflectance 6 Normals 7..15Reserved

In the packed_video( ) structure a definitionpv_region_top_left_x[j][i][k] specifies horizontal position of top leftof k-th region in unit of luma samples.

In the packed_video( ) structure a definitionpv_region_top_left_y[j][i][k] specifies vertical position of top left ofk-th region in unit of luma samples.

In the packed_video( ) structure a definitionpv_region_width_minus1[j][i][k] plus 1 specifies the width of k-thregion in unit of luma samples.

In the packed_video( ) structure a definitionpv_region_height_minus1[j][i][k] plus 1 specifies the height of k-thregion in unit of luma samples

According to another embodiment, packed_video( ) can also indicate inwhich color plane a packed data is present. In this case a video framecan be encoded as separate color planes (e.g. separate_colour_plane_flagset to 1 in HEVC see Table 6-1 of 23008-2, or in WC see Table 2 of23090-3). This could be utilized for example for packing transparencyand material ID in the same spatial region but on different planes. Anexample of this is given in a table below containing a definitionpv_region_plane_id[j][i][k]

packed_video( ) { Descriptor  for(j = 0; j < vps_atlas_count_minus1 + 1;j++ ) {   pv_packed_count_minus1[ j ] u(8)   for(i = 0; i <vps_atlas_count_minus1 + 1; i++ ) {    pv_codec_id[ j ][ i ] u(8)   pv_num_regions_minus1[ j ][ i ] u(8)    for( k = 0; k <=pv_num_regions_minus1[ j ][ i ]; k++ ) {     pv_region_type_id[ j ][ i][ k ] u(4)     pv_region_top_left_x[ j ][ i ][ k ] u(v)    pv_region_top_left_y[ j ][ i ][ k ] u(v)     pv_region_width_minus1[j ][ i ][ k ] u(v)     pv_region_height_minus1[ j ][ i ][ k ] u(v)    pv_region_plane_id[ j ][ i ][ k ] u(2)    }   }  } }

In the table above, pv_region_plane_id[j][i][k] specifies the colourplane associated with of k-th region. The value of pv_region_plane_idshall be in the range of 0 to 3, inclusive. pv_region_plane_id values 1,2 and 3 correspond to the Y, Cb and Cr planes, respectively.pv_region_plane_id value 0 indicates than all planes are associated withthis region. These alternatives, which should not be unnecessarilyinterpreted as limiting examples, are listed in below:

pv_region_plane_id Description 0 in all planes 1 in “Red” plane 2 in“Green” plane 3 in “Blue” plane

2. New Ai_Attribute_Type_Id Identifier

A new value for ai_attribute_type_id is defined to inform a decoder thatthe data in attributed video bitstream contains packed attributes. Inthe following table, which is a non-limiting example, the new value is 5for an identifier ATTR_PACKED.

ai_attribute_type_id[ j ][ i ] Identifier Attribute type 0 ATTR_TEXTURETexture 1 ATTR_MATERIAL ID Material ID 2 ATTR_TRANSPARENCY Transparency3 ATTR_REFLECTANCE Reflectance 4 ATTR_NORMAL Normals 5 ATTR_PACKEDPacked 6...14 ATTR_RESERVED Reserved 15 ATTR_UNSPECIFIED Unspecified

A new SEI message is further defined to provide information on how tointerpret the new attribute. An example of such SEI message is shownbelow:

packed_attribute(payloadSize) { Descriptor pa_packed_attribute_count_minus1 u(8)  for(i = 0; i <vps_atlas_count_minus1 + 1; i++ ) {   pa_attribute_index[ i ] u(8)  pa_num_regions_minus1[ i ] u(8)   for( k = 0; k <= pa num regionsminus1 [ i ]; k++ ) {    pa_region_type_id[ i ][ k ] u(4)   pa_region_top_left_x[ i ] [ k ] u(v)    pa_region_top_left_y[ i ] [ k] u(v)    pa_region_width_minus1[ i ][ k ] u(v)   pa_region_height_minus1[ i ][ k ] u(v)    pa_region_plane_id[ i ][ k] u(2)   }  } }

In the packed_attribute( ) structure a definitionpa_packed_attribute_count_minus1 plus one indicates the count ofattributes of type packed.

In the packed_attribute( ) structure a definition pa_attribute_index[i]indicates the attribute index of the i-th packed video bitstream thatthe current SEI message signalling referred.

In the packed_attribute( ) structure a definitionpa_num_regions_minus1[i] plus 1 specifies the number of regions in apacked video bitstream with index i. The value for pv_num_regions_minus1shall be in the range of 0 to 254.

In the packed_attribute( ) structure a definitionpa_region_type_id[i][k] specifies type of region with index k in apacked video bitstream with index i. Table below describes the list ofpossible region types. It is to be appreciated that the list should notbe unnecessarily interpreted as limiting, but the values for variousattribute types may vary.

pa_region_type_id [ j ][ i ] Attribute type 0 Texture 1 Material ID 2Transparency 3 Reflectance 4 Normals 5..15 Reserved

In the packed_attribute( ) structure a definitionpa_region_top_left_x[i][k] specifies horizontal position of top left ofk-th region in unit of luma samples.

In the packed_attribute( ) structure a definitionpa_region_top_left_y[i][k] specifies vertical position of top left ofk-th region in unit of luma samples.

In the packed_attribute( ) structure a definitionpa_region_width_minus1[i][k] plus 1 specifies the width of k-th regionin unit of luma samples.

In the packed_attribute( ) structure a definitionpa_region_height_minus1[i][k] plus 1 specifies the height of k-th regionin unit of luma samples

In the packed_attribute( ) structure a definitionpa_region_plane_id[i][k] specifies the colour plane associated with ofk-th region. The value of pv_region_plane_id shall be in the range of 0to 3, inclusive. pv_region_plane_id values 1, 2 and 3 correspond to theY, Cb and Cr planes, respectively. pv_region_plane_id value 0 indicatesthan all planes are associated with this region. Table below describesthe list of possible region planes. It is to be appreciated that thelist should not be unnecessarily interpreted as limiting, but the valuesfor various attribute types may vary.

pa_region_plane_id Description 0 in all planes 1 in Y plane 2 in Cbplane 3 in Cr plane

3. New Packed_Patches( ) Syntax Structure

With the packing of different components to one video frame,atlas_sequence_parameter_set( ) provides information on how todifferentiate patches and link them together. This is accomplished bydefining a new extension mechanism that is signalled byaps_packed_patches_extension_flag. When this flag is set, apacked_patches( ) structure is provided in atlas_sequence_parameter_set().

atlas_sequence_parameter_set_rbsp( ) { Descriptor asps_atlas_sequence_parameter_set_id ue(v)  asps_frame_width u(16) asps_frame_height u(16)  asps_vui_parameters_present_flag u(1)  ... if( asps_vui_parameters_present_flag)   vui_parameters( ) aps_packed_patches_extension_flag u(1)  if(aps_packed_patches_extensionflag)   packed_patches( )  asps_extension_present_flag u(1)  if(asps_extension_present_flag)   while(more_rbsp_data( ))   asps_extension_data_flag u(1)  rbsp_trailing_bits( ) }

According to an embodiment, tile groups of atlas data are aligned withregions in packed video frames and are constant throughout the codedatlas sequence. In each tile group, patch indexes are independent, whichmeans that they have been counted from 0. Whenaps_packed_patches_extension_flag is set to 1, then patch with index Xin tile group 0 correspond to all other patches with index X in allother tile groups with indices 1 to N.

FIG. 6 shows an example of how geometry and texture may be frame packedto one video frame and patch data may be packed in tile groups thatcorrespond to packed regions, and how these are signaled. FIG. 6 showstwo different tile groups (tile group 0, tile group 1) for differentcomponents (geometry and texture, respectively). The patches inside thetile groups may be signaled separately or the same patch-layout may beused. In this example, the same patch identification (idx1, idx2, idx3)between these groups (tile group 0, tile group 1) represents the samepatch.

Attribute (texture) related patches should not contain any information(e.g. value should be 0) related to 3D (e.g. field pdu_3d_pos_x,pdu_3d_pos_y, pdu_3d_pos_min_z, pdu_3d_pos_delta_max_z, pdu_lod) and adecoder should not try to interpret that information.

It is to be appreciated that when there is more than one video bitstreamin vpcc sequence, and each of the video bitstream would containdifferent packing, then other methods can be utilized to carry mixedpatch packing layout.

packed_patches( ) { Descriptor  pp_tile_groups_count_minus1 u(8)  for(i= 0; i < pp_tile_groups_count_minus1 + 1; i++ ) {   pp_tile_group_id[ i] u(8)   pp_tile_group_type[ i ] u(8)  } }

In the packed_patches( ) structure a definitionpp_tile_groups_count_minus1 plus one specifies the number of tilegroups.

In the packed_patches( ) structure a definition pp_tile_group_id[i]specifies tile group id.

In the packed_patches( ) structure a definition pp_tile_group_type[i]specifies type of patches in that tile group with pp_tile_group_id[i].An example of values for pp_tile_group_id is given in the table below,which should not be unnecessarily interpreted as limiting:

pp_tile_group_type Description 0 3D patches all data provided bypatch_information_data( ) should be interpreted by the decoded 1 2Dpatches only data related to the position and orientation of the patchin video frame should be interpreted by the decoder (the 3D informationis provided by patches in tile group type 0)

According to an embodiment, tile groups of atlas data may be alignedwith regions in packed video frames and may be constant throughout codedatlas sequence. Tile groups with no patch data may also exist.packed_patches( ) structure provides information on how to interpretpatches and from where patches can be copied, if needed. FIG. 8 b showsan example on packed_patches( ) informing how to interpret atlas data.

FIG. 7 illustrates an example of geometry and texture packed to onevideo frame and patch data packed in one tile group. Regions can copydata from other tile groups. The difference of FIG. 7 compared to FIG. 6is that the patches between tile groups (tile group 0, tile group 1)share the same layout. Patch in geometry tile group (tile group 0) isfound in the same place in the texture. This allows signaling empty tilegroup, i.e. SKIP_TILE_GROUP, instead of explicitly signaling theposition of the patch in the other tile group.

It is to be noticed that when there is more than one video bitstream invpcc sequence, and each of the video bitstreams contains differentpacking, then other methods can be utilized to carry mixed patch packinglayout.

In the following an example of packed_patches( ) structure is shown:

packed_patches( ) { Descriptor  pp_tile_groups_count_minus1 u(8)  for(i= 0; i < pp_tile_groups_count_minus1 + 1; i++ ) {   pp_tile_group_id[ i] u(8)   pp_tile_group_type[ i ] u(8)   if(pp_tile_group_type[i] = = 2)   pp_source_tile_group_id[ i ]    pp_copy_type[ i ]  } }

In the packed_patches( ) structure a definitionpp_tile_groups_count_minus1 plus one specifies the number of tilegroups.

In the packed_patches( ) structure a definition pp_tile_group_id [i]specifies tile group identification.

In the packed_patches( ) structure a definition pp_tile_group_type [i]specifies type of patches in that tile group with itpp_tile_group_id[i]. Non-limiting examples of possiblepp_tile_group_type values are given below:

pp_tile_group_type Description 0 3D patches all data provided bypatch_information_data( ) should be interpreted by the decoded 1 2Dpatches only data related to the position and orientation of the patchin video frame should be interpreted by the decoder (the 3D informationis provided by patches in tile group type 0) 2 Tile group does notcontain any patch data. Patch data should be copied from other tilegroup indicated pp_source_tile_group_id

In the packed_patches( ) structure a definitionpp_source_tile_group_id[i] specifies an identification for a tile groupfrom which patch data should be copied for tile group having anidentification equal to pp_tile_group_id

In the packed_patches( ) structure a definition pp_copy_type[i]specifies what type information should be copied from the source tilegroup and how tile group with identification equal to pp_tile_group_idshall be interpreted

pp_copy_type Description 0 copy all data 1 copy only 2D related data

According to another embodiment the patch_packed( ) information iscarried in atlas_frame_parameter_set and can change on frame to framebasis.

4. New Patch Modes in Patch Information Data and New Patch Data UnitStructures.

In this example a new vuh_unit_type VPCC_PVD is defined to indicate thatVPCC unit contain video bitstream containing patch data from differentcomponents.

With the definition of new VPCC unit type vpcc_parameter_set( ) containspacked_video( ) structure that provides information only about the codecbeing used for encoding the video. A non-limiting example of a structureof packed_video is given in the following:

packed_video( ) { Descriptor  for(j = 0; j < vps_atlas_count_minus1 + 1;j++ ) {   pv_packed_count_minus1[ j ] u(4)   for(i = 0; i <vps_atlas_count_minus1 + 1; i++ ) {    pv_codec_id[ j ][ i ] u(8)   }  }}

In the packed_video( ) structure a definition pv_packed_count_minus1[j]plus one specifies the number of packed video bitstreams associated withthe atlas with index j. pv_packed_count_minus1 shall be in the range of0 to 15, inclusive.

In the packed_video( ) structure a definition pv_codec_id[j][i]indicates the identifier of the codec used to compress the packed videodata with index i for the atlas with index j. pv_codec_id[j][i] shall bein the range of 0 to 255, inclusive. This codec may be identifiedthrough a component codec mapping SEI message or through other means.

To indicate components which the patches belong to, and the relationbetween patches, new patch modes and new patch unit structures aredefined.

The following table lists patch mode types for I_TILE_GRP type atlastile groups:

atgdu_patch_mode Identifier Description 0 I_INTRA Non-predicted Patchmode 1 I_RAW RAW Point Patch mode 2 I_EOM EOM Point Patch mode 3I_PACKED Packed Patch mode 4-13 I_RESERVED Reserved modes 14 1_END Patchtermination mode

The following table lists patch mode types for P TILE GRP type atlastile groups:

atgdu_patch_mode Identifier Description 0 P_SKIP Patch Skip mode 1P_MERGE Patch Merge mode 2 P_INTER Inter predicted Patch mode 3 P_INTRANon-predicted Patch mode 4 P_RAW RAW Point Patch mode 5 P_EOM EOM PointPatch mode 6 P_INTER PACKED Packed Patch mode 7 P_INTRA PACKED PackedPatch mode 8-13 P_RESERVED Reserved modes 14 P_END Patch terminationmode

A non-limiting example of a structure of patch_information_data is givenin the following:

patch_information_data ( patchIdx, patchMode ) { Descriptor  if (atgh_type = = SKIP_TILE_GR )   skip_patch_data_unit( patchIdx )  else if( atgh_type = = P_TILE_GR ) {   if(patchMode = = P_SKIP )   skip_patch_data_unit( patchIdx )   else if(patchMode = = P_MERGE )   merge_patch_data_unit( patchIdx )   else if( patchMode = = P_INTRA )   patch_data_unit( patchIdx )   else if( patchMode = = P_INTER)   inter_patch_data_unit( patchIdx )   else if( patchMode = = P_RAW )   raw_patch_data_unit( patchIdx )   else if( patchMode = = P_EOM )   eom_patch_data_unit( patchIdx )   else if( patchMode = =P_INTRA_PACKED )    packed_patch_data_unit( patchIdx )   else if(patchMode = = P_INTER_PACKED)    inter_packed_patch_data_unit( patchIdx)  }  else if ( atgh_type = = I_TILE_GR ) {   if( patchMode = = I_INTRA)    patch_data_unit( patchIdx )   else if( patchMode = = I_RAW )   raw_patch_data_unit( patchIdx )   else if( patchMode = = I_EOM )   eom_patch_data_unit( patchIdx )   else if( patchMode = = I_PACKED )   packed_patch_data_unit( patchIdx )  } }

A non-limiting example of a structure of packed_patch_data_unit is givenin the following:

packed_patch_data_unit( patchIdx ) { Descriptor  ppdu_2d_pos_x[ patchIdx] u(v)  ppdu_2d_pos_x [ patchIdx ] u(v)  ppdu_2d_delta_size_x[ patchIdx] se(v)  ppdu_2d_delta_size_y[ patchIdx ] se(v) ppdu_3d_info_tile_group_id[patchIdx ] u(8)  ppdu_3d_patch_index[patchIdx ] u(8)  ppdu_data_type_id[ patchIdx ] u(8) }

Semantics for ppdu_2d_pos_x, ppdu_2d_pos_x, ppdu_2d_delta_size_x,ppdu_2d_delta_size_y are the same as in pach_data_unit and provide 2dposition of a patch.

In the packed_patch_data_unit( ) structure a definitionppdu_3d_info_tile_group_id[j] specifies the id of a tile group in whichthe related patch data with 3d information is present.

In the packed_patch_data_unit( ) structure a definitionppdu_3d_info_patch_index[j] specifies the index of a patch in tile groupindicated by ppdu_3d_info_tile_group_id[j] that contains the relatedpatch data with 3d information.

In the packed_patch_data_unit( ) structure a definitionppdu_data_type[j] specifies type of data the patch contains in a packedvideo bitstream. Table below describes the list of possible regiontypes, which should not be unnecessarily interpreted as limitingexamples. It is to be also noticed that geometry data may be describedby the standard patch_data_unit( )

ppdu_data_type [ j ][ i ] data type 0 Occupancy 1 Texture 2 Material ID3 Transparency 4 Reflectance 5 Normals 6..15 Reserved

A non-limiting example of a structure of inter_packed_patch_data_unit isgiven in the following:

inter_packed_patch_data_unit( patchIdx ) { Descriptor  ippdu_2d_pos_x[patchIdx ] u(v)  ippdu_2d_pos_x [ patchIdx ] u(v) ippdu_2d_delta_size_x[ patchIdx ] se(v)  ippdu_2d_delta_size_y[patchIdx ] se(v)  ippdu_3d_info_tile_group_id[ patchIdx ] u(8) ippdu_3d_patch_index[ patchIdx ] u(8)  ippdu_data_type_id[ patchIdx ]u(8) }

Semantics for ippdu_2d_pos_x, ippdu_2d_pos_x, ippdu_2d_delta_size_x,ippdu_2d_delta_size_y are the same as in inter_pach_data_unit andprovide 2d position of a patch.

Semantics for ippdu_3d_info_tile_group_id, ippdu_3d_patch_index,ippdu_data_type_id are the same as in packed_pach_data_unit( )

According to an embodiment, ippdu_data_type_id, ppdu_data_type_id fieldsare not present in inter_packed_patch_data_unit, andpacked_patch_data_unit, respectively. The type of data (e.g. color,geometry, etc.) in a given patch may be deducted based on the region thepatch is in. Regions of a video frames and their type are signalled inpatch_video( ) structure which is part of vpcc_parameter_set( )

It is to be noticed that the structures and semantics of thepacked_video( ) are the same as in previous embodiments.

packed_video( ){ Descriptor  for(j = 0; j < vps_atlas_count_minus1 + 1;j++) {   pv_packed_count_minus1[ j ] u(4)   for(i = 0; i <vps_atlas_count_minus1 + 1; i++ ) {    pv_codec_id[ j ][ i ] u(8)   pv_num_regions_minus1[ j ][ i ] u(8)    for( k = 0; k <=pv_num_regions_minus1[ j ][ i ]; k++ ) {     pv_region type_id[ j ][ i][ k ] u(4)     pv_region_top_left_x[ j ][ i ][ k ] u(v)    pv_region_top_left_y[ j ][ i ][ k ] u(v)     pv_region_width_minus1[j ][ i ][ k ] u(v)     pv_region_height_minus1[ j ][ i ][ k ] u(v)    }  }  } }

5. New SEI Message for Flexible Metadata Signalling

According to an embodiment a new SEI message may be introduced inatlas_sub_bitstream, which identifies the video encoded bitstream towhich the following or preceding metadata relates to. As an example, thenew SEI message may precede a patch layout related information likeatlas_tile_group_layer_rbsp( ) in which case the following NAL units areapplied to specific video component, attribute or atlas that iscontained in another video encoded bitstream. This design allowsflexible storage of patches of certain type in another video track ofanother type. A non-limiting example of a structure ofseparate_atlas_component is given in the following:

separate_atlas_component( payloadSize ) { Descriptor   component_typeu(5)   attribute_index u(7)   atlas_id u(6)   metadata_component_typeu(5)  metadata_attribute_index u(7)  metadata_atlas_id u(6) }

In the separate_atlas_component structure a definition component_typesignals the video encoded component type which is needed to identify aspecific video encoded bitstream.

In the separate_atlas_component structure a definition attribute_indexsignals the video encoded attribute type which is needed to identify aspecific video encoded bitstream.

In the separate_atlas_component structure a definition atlas_id signalsthe video encoded atlas id which is needed to identify a specific videoencoded bitstream.

In the separate_atlas_component structure a definitionmetadata_component_type shall signal the component type to which thefollowing or preceding NAL units should be applied to. When signallingpatches this parameter identifies the type of the patch or patch tilegroup.

In the separate_atlas_component structure a definitionmetadata_attribute_index shall signal the attribute_index to which thefollowing or preceding NAL units should be applied to. When signallingpatches this parameter identifies the attribute index of the patch orpatch tile group. This information is optional and is only needed, ifpatches from different attribute index should be packed together.

In the separate_atlas_component structure a definition metadata_atlas_idshall signal the atlas index to which the following or preceding NALunits should be applied to. When signalling patches this parameteridentifies the atlas index of the patch or patch tile group. Thisinformation is optional and should be only used, if atlases withdifferent index should be packed in the same video encoded bitstream.

A method according to an embodiment is shown in FIG. 9 . The methodgenerally comprises receiving 910 as an input a volumetric video framecomprising volumetric video data; decomposing 920 the volumetric videoframe into one or more patches, wherein a patch comprises a video datacomponent; packing 930 several patches, where at least two patches ofthe several patches comprise a different volumetric video data componentwith respect to each other, into one video frame; generating 940 abitstream comprising an encoded video frame; signaling 950, in or alongthe bitstream, existence of encoded video frame containing patches ofmore than one different volumetric video data component; andtransmitting 960 the encoded bitstream to a storage for rendering.

An apparatus according to an embodiment comprises means for receiving asan input a volumetric video frame comprising volumetric video data;means for decomposing the volumetric video frame into one or morepatches, wherein a patch comprises a volumetric video data component;means for packing several patches, where at least two patches of theseveral patches comprise a different volumetric video data componentwith respect to each other, into one video frame; means for generating abitstream comprising an encoded video frame; means for signaling, in oralong the bitstream, existence of encoded video frame containing patchesof more than one different volumetric video data component; and meansfor transmitting the encoded bitstream to a storage for rendering. Themeans comprises at least one processor, and a memory including acomputer program code, wherein the processor may further compriseprocessor circuitry. The memory and the computer program code areconfigured to, with the at least one processor, cause the apparatus toperform the method according to various embodiments.

An example of an apparatus is disclosed with reference to FIG. 10 . FIG.10 shows a block diagram of a video coding system according to anexample embodiment as a schematic block diagram of an electronic device50, which may incorporate a codec. In some embodiments the electronicdevice may comprise an encoder or a decoder. The electronic device 50may for example be a mobile terminal or a user equipment of a wirelesscommunication system or a camera device. The electronic device 50 may bealso comprised at a local or a remote server or a graphics processingunit of a computer. The device may be also comprised as part of ahead-mounted display device. The apparatus 50 may comprise a display 32in the form of a liquid crystal display. In other embodiments of theinvention the display may be any suitable display technology suitable todisplay an image or video. The apparatus 50 may further comprise akeypad 34. In other embodiments of the invention any suitable data oruser interface mechanism may be employed. For example, the userinterface may be implemented as a virtual keyboard or data entry systemas part of a touch-sensitive display. The apparatus may comprise amicrophone 36 or any suitable audio input which may be a digital oranalogue signal input. The apparatus 50 may further comprise an audiooutput device which in embodiments of the invention may be any one of:an earpiece 38, speaker, or an analogue audio or digital audio outputconnection. The apparatus 50 may also comprise a battery (or in otherembodiments of the invention the device may be powered by any suitablemobile energy device such as solar cell, fuel cell or clockworkgenerator). The apparatus may further comprise a camera 42 capable ofrecording or capturing images and/or video. The camera 42 may be amulti-lens camera system having at least two camera sensors. The camerais capable of recording or detecting individual frames which are thenpassed to the codec 54 or the controller for processing. The apparatusmay receive the video and/or image data for processing from anotherdevice prior to transmission and/or storage.

The apparatus 50 may comprise a controller 56 or processor forcontrolling the apparatus 50. The apparatus or the controller 56 maycomprise one or more processors or processor circuitry and be connectedto memory 58 which may store data in the form of image, video and/oraudio data, and/or may also store instructions for implementation on thecontroller 56 or to be executed by the processors or the processorcircuitry. The controller 56 may further be connected to codec circuitry54 suitable for carrying out coding and decoding of image, video and/oraudio data or assisting in coding and decoding carried out by thecontroller.

The apparatus 50 may further comprise a card reader 48 and a smart card46, for example a UICC (Universal Integrated Circuit Card) and UICCreader for providing user information and being suitable for providingauthentication information for authentication and authorization of theuser at a network. The apparatus 50 may comprise radio interfacecircuitry 52 connected to the controller and suitable for generatingwireless communication signals for example for communication with acellular communications network, a wireless communications system or awireless local area network. The apparatus 50 may further comprise anantenna 44 connected to the radio interface circuitry 52 fortransmitting radio frequency signals generated at the radio interfacecircuitry 52 to other apparatus(es) and for receiving radio frequencysignals from other apparatus(es). The apparatus may comprise one or morewired interfaces configured to transmit and/or receive data over a wiredconnection, for example an electrical cable or an optical fiberconnection.

The various embodiments can be implemented with the help of computerprogram code that resides in a memory and causes the relevantapparatuses to carry out the method. For example, a device may comprisecircuitry and electronics for handling, receiving and transmitting data,computer program code in a memory, and a processor that, when runningthe computer program code, causes the device to carry out the featuresof an embodiment. Yet further, a network device like a server maycomprise circuitry and electronics for handling, receiving andtransmitting data, computer program code in a memory, and a processorthat, when running the computer program code, causes the network deviceto carry out the features of an embodiment. The computer program codecomprises one or more operational characteristics. Said operationalcharacteristics are being defined through configuration by said computerbased on the type of said processor, wherein a system is connectable tosaid processor by a bus, wherein a programmable operationalcharacteristic of the system comprises receiving as an input avolumetric video frame comprising volumetric video data; decomposing thevolumetric video frame into one or more patches, wherein a patchcomprises a video data component; packing several patches, where atleast two patches of the several patches comprise a different volumetricvideo data component with respect to each other, into one video frame;generating a bitstream comprising an encoded video frame; signaling, inor along the bitstream, existence of encoded video frame containingpatches of more than one different volumetric video data component; andtransmitting the encoded bitstream to a storage for rendering.

A computer program product according to an embodiment can be embodied ona non-transitory computer readable medium. According to anotherembodiment, the computer program product can be downloaded over anetwork in a data packet.

If desired, the different functions discussed herein may be performed ina different order and/or concurrently with other. Furthermore, ifdesired, one or more of the above-described functions and embodimentsmay be optional or may be combined.

Although various aspects of the embodiments are set out in theindependent claims, other aspects comprise other combinations offeatures from the described embodiments and/or the dependent claims withthe features of the independent claims, and not solely the combinationsexplicitly set out in the claims.

It is also noted herein that while the above describes exampleembodiments, these descriptions should not be viewed in a limitingsense. Rather, there are several variations and modifications, which maybe made without departing from the scope of the present disclosure as,defined in the appended claims.

1-15. (canceled)
 16. A method, comprising: receiving, as an input, avolumetric video frame comprising volumetric video data; decomposing thevolumetric video frame into one or more patches, wherein a patchcomprises a volumetric video data component; packing two or morepatches, wherein at least two patches of the two or more patchescomprise a different volumetric video data component with respect toeach other, into one video frame; generating a bitstream comprising anencoded video frame; signaling, in or along the bitstream, existence ofthe encoded video frame comprising patches of more than one differentvolumetric video data component; and transmitting the encoded bitstreamfor rendering.
 17. The method according to claim 16, wherein thevolumetric video data component comprises one of the following: geometrydata, or attribute data.
 18. The method according to claim 16, whereinsaid signaling is configured to be provided in at least one structure ofvideo-based point cloud compression bitstream.
 19. The method accordingto claim 16, wherein the bitstream comprises a signal indicating alinkage between atlas data and packed video data.
 20. The methodaccording to claim 16, further comprising encoding a type of thevolumetric video data component into atlas data.
 21. The methodaccording to claim 16, further comprising mapping a patch to video framepacking regions signaled in the bitstream.
 22. The method according toclaim 16, further comprising indicating in the bitstream that a videoframe comprising patches comprising more than one attribute data.
 23. Anapparatus comprising at least one processor, memory including computerprogram code, the memory and the computer program code configured to,with the at least one processor, cause the apparatus to perform at leastthe following: receive as an input a volumetric video frame comprisingvolumetric video data; decompose the volumetric video frame into one ormore patches, wherein a patch comprises a video data component; pack twoor more patches, where at least two patches of the two or more patchescomprise a different volumetric video data component with respect toeach other, into one video frame; generate a bitstream comprising anencoded video frame; signal, in or along the bitstream, existence of theencoded video frame comprising patches of more than one differentvolumetric video data component; and transmit the encoded bitstream forrendering.
 24. The apparatus according to claim 23, wherein thevolumetric video data component comprises one of the following: geometrydata, or attribute data.
 25. The apparatus according to claim 23,wherein said signaling is configured to be provided in at least onestructure of video-based point cloud compression bitstream.
 26. Theapparatus according to claim 23, wherein the bitstream comprises asignal indicating a linkage between atlas data and packed video data.27. The apparatus according to claim 23, wherein the apparatus isfurther caused to: encode a type of the volumetric video data componentinto atlas data.
 28. The apparatus according to claim 23, wherein theapparatus is further caused to: map a patch to video frame packingregions signaled in the bitstream.
 29. The apparatus according to claim23, wherein the apparatus is further caused to: indicate in thebitstream that the video frame comprises patches comprising more thanone attribute data.
 30. The apparatus according to claim 29, wherein theattribute data comprises at least one of the following: texture,material identification, transparency, reflectance, or normal.
 31. Theapparatus according to claim 23, wherein the apparatus is further causedto: encode into the bitstream an indication on how patches aredifferentiated and linked together.
 32. The apparatus according to claim23, wherein the apparatus is further caused to: generate a structurecomprising information about packing regions.
 33. The apparatusaccording to claim 23, wherein the apparatus is further caused to:encode the video frame as separate color planes.
 34. The apparatusaccording to claim 23, wherein the apparatus is further caused to:encode into a bitstream information about codec being used for encodingthe video frame.
 35. The apparatus according to claim 23, wherein theapparatus is further caused to: generate a structure identifying theencoded bitstream to which a metadata is related to.